Hub-Accelerator: Fast and Exact Shortest Path 
Computation in Large Social Networks 



Ruoming Jint Ning Ruan* Bo You^ Haixun Wang* 
t Kent State University * Google Inc * Microsoft Researcin Asia 

{jin,byou}@cs.kent.edu ningruan@google.com haixunw@microsoft.com 



m ; ABSTRACT 



O . 
(N 



(N 



00 



> 

o 
in 
p 

in 
o 
m 



X 



Shortest path computation is one of the most fundamental opera- 
tions for managing and analyzing large social networks. Though 
existing techniques are quite effective for finding the shortest path 
on large but sparse road networks, social graphs have quite differ- 
ent characteristics: they are generally non-spatial, non-weighted, 
scale-free, and they exhibit small-world properties in addition to 
their massive size. In particular, the existence of hubs, those ver- 
tices with a large number of connections, explodes the search space, 
making the shortest path computation surprisingly challenging. In 
this paper, we introduce a set of novel techniques centered around 
hubs, collectively referred to as the Hub- Accelerator framework, to 
compute the fc-degree shortest path (finding the shortest path be- 
tween two vertices if their distance is within k). These techniques 
enable us to significantly reduce the search space by either greatly 
limiting the expansion scope of hubs (using the novel distance- 
preserving Hub-Network concept) or completely pruning away the 
hubs in the online search (using the Hub^ -Labeling approach). The 
Hub-Accelerator approaches are more than two orders of magni- 
tude faster than BFS and the state-of-the-art approximate shortest 
path method Sketch for the shortest path computation. The Hub- 
Network approach does not introduce additional index cost with 
light pre-computation cost; the index size and index construction 
cost of Hub^ -Labeling are also moderate and better than or compa- 
rable to the approximation indexing Sketch method. 

1. INTRODUCTION 

Social networks are becoming ubiquitous and their data volume 
is increasing dramatically. The popular online social network web- 
sites, such as Facebook, Twitter, and Linkedin, all have hundreds 
of millions of active users nowadays. Google's new social network 
Google+ attracted 25 million unique users and was growing at a 
rate of roughly one million visitors per day in the first month after 
launch. Enabling online and interactive query processing of these 
massive graphs, especially to quickly capture and discover the re- 
lationship between entities, is becoming an indispensable compo- 
nent for emerging applications ranging from the social sciences to 
advertisement and marketing research, to homeland security. 

Shortest path computation is one of the most basic yet critical 
problems for managing and querying social networks. The social 
network website Linkedin pioneered the well-known shortest-path 
service "How you're connected to A", which offers a precise de- 
scription of the friendship chain between you and a user A within 
3 steps. Microsoft's Renlifang (EntityCube) 1371 . which records 
over a billion relationships for over 10 million entities (people, lo- 
cations, organizations), allows users to retrieve the shortest path 
between two entities if their distance is less than or equal to 6. The 
newly emerged online application "Six Degrees" 1381 provides an 



interactive way to demonstrate how you connect to other people in 
your Facebook network. In addition, shortest path computation is 
also useful in determining trust and discovering friends in online 
games 14 1 1 1421 . 

In this paper, we investigate the fc-degree shortest path query 
(fc < 6 in general), which can be formally described as: Given 
two vertices (users) s and t in a large (social) network, what is the 
shortest path from s to t if their distance is less than or equal to 
k? In all these emerging social network applications, (one) short- 
est path between two users needs to be computed generally only 
if their distance is less than a certain threshold (such as 6). Such 
a focus directly resonates with the small-world phenomenon being 
observed in these massive social networks. For instance, the aver- 
age pairwise distance on a large sample of Facebook users 0381 has 
been shown to be only 5.73. Also, around half the users on Twitter 
are on average 4 steps away from another while nearly everyone is 
5 steps away |39| . Not only are most of the users in large social 
networks separated by less than 6 steps, the longer connections or 
paths in social networks are also less meaningful and/or useful. 

Computing fc-degree shortest path in a large social network is 
surprisingly challenging, especially when fc is relatively large, such 
as fc = 6. A single BFS (Breadth-First-Search) can easily visit 
more than a million vertices in 6 steps in a large network with a 
few million of vertices. Though existing techniques II20I I2II 1341 
[3l][3l[ni[l5l[3l[32l[23l[Il[^|4lare vei-y effective for finding 
the shortest path on large but sparse road networks, social graphs 
have quite different characteristics. Instead of being spatial, with 
edge weight, and having low vertex degree, social networks are 
generally non-spatial, non-weighted, scale-free (therefore contain- 
ing high-degree hub nodes), and they exhibit small-world proper- 
ties in addition to their massive size. Indeed, due to the difficulty in 
finding the shortest path in social networks, the recent studies II6I 
141 1 1421 all focus on discovering only the approximate ones (longer 
than the true shortest path). Furthermore, even with the approxima- 
tion, the fastest methods, such as Sketch 1161 , TreeSketch 1161 , and 
RigelPaths 1421 , still need tens or hundreds of milliseconds (10"'^ 
second) to compute an approximate shortest path in a social net- 
work with a few million vertices. 

The central problem of shortest path computation in massive so- 
cial network comes from hubs: those vertices with a large number 
of connections. The number of hubs may be small compared to 
the total network size; however, they appear in the close neigh- 
borhood of almost any vertex. Indeed, hubs play a critical role in 
the small-world (social) networks; they serve as the common me- 
diators linking the shortest path between vertices, just like the hub 
cities in the small-world network of airline flight. In fact, theoret- 
ical analysis shows that a small number of hubs (due to the power 
law degree distribution) significantly shortens the distance between 



vertices and makes networks "ultra-small" ||8]. However, hubs are 
the key contributing factor to the search-space explosion. Assum- 
ing a hub has 5, 000 friends and normal persons have about 100 
friends, then a two-step BFS from the hub will visit ~ 500, 000 
vertices; in the Twitter network, some vertices (celebrities) contain 
more than 10 million followers, so a reverse one-step BFS (from 
that vertex to its followers) is already too expensive. Thus, hubs 
are at the center of the problem: shortest paths do not exist without 
them; but they make the discovery extremely hard. Can we disen- 
tangle the love-hate relationship between shortest path and hubs? 
Can we make hubs more amicable for shortest path computation? 

In this paper, we provide a positive answer to these challenging 
problems on shortest path computation in massive social graphs. 
We introduce a list of novel techniques centered around hubs, col- 
lectively referred to as the Hub- Accelerator framework. These tech- 
niques enable us to significantly reduce the search space by ei- 
ther greatly limiting the expansion scope of hubs (using the novel 
distance-preserving hub-network concept) or completely pruning 
away the hubs in the online search (using the Hub^ -labeling ap- 
proach). The Hub- Accelerator approaches are on average more 
than two orders of magnitude faster than the BFS and the state-of 
the-art approximate shortest path methods, including Sketch 1161 , 
TreeSketch [16| , and RigelPaths 1421 . The Hub-Network approach 
does not introduce additional index cost with light pre-computation 
cost; the index size and index construction cost of Hub^-Labeling 
are also moderate and better than or comparable to the approxima- 
tion indexing Sketch method. We note that though the shortest path 
computation has been extensively studied, most of the studies only 
focus on road networks JSO 21 34 3T] |2] QT] |T5 30 32 23]|T4l 
[35]|4l[2l[T| or approximate shortest path (distance) computation on 
massive social networks II 161 1421 . To our best knowledge, this is 
the first work explicitly addressing the exact shortest path compu- 
tation in these networks. The Hub- Accelerator techniques are also 
novel and the distance-preserving subgraph (hub-network) discov- 
ery problem itself is of both theoretical and practical importance 
for graph mining and management. 

2. RELATED WORK 

In the following, we will review the existing methods on short- 
est path computation, especially those related to social networks. 
Throughout our discussion, we use n and m to denote the number 
of nodes and edges in the graph G, respectively. 
Online Shortest Path Computation: One of the most well-known 
methods for shortest path computation is Dijkstra's algorithm 1121 . 
It computes the single source shortest paths in a weighted graph 
and can be implemented with 0(m + n log n) time. If the graph is 
unweighted (as are many social networks), a Breadth-First Search 
(BFS) procedure can compute the shortest path in 0{m + n). How- 
ever, it is prohibitively expensive to apply these methods to a social 
network with millions of vertices, even when limiting the search 
depth to 6 steps. First, the average degree in the social network 
is relatively high. For instance, each user in Facebook on average 
has about 130 friends. A straightforward BFS would easily scan 
one million vertices within 6 steps. A simple strategy is to employ 
bidirectional search to reduce the search space. Second, due to the 
existence of hubs and the small-world property, a large number of 
hubs may be traversed in bidirectional BFS (even within three steps 
of the start s or end t of the shortest path query). For instance, in 
the Orkut graph (a frequently used benchmarking social network), 
which consists of over 3 million vertices and 220 million edges, a 
bidirectional BFS still needs to access almost 2QQK vertices per 
query while traditional BFS needs to access almost 1.6 million ver- 
tices per query. 



Shortest Path Computation on Road Networks: Computing short- 
est path on road networks has been widely studied 1120 1 1211 l34l 1311 
[11 [HI ESI [30l [111 [111 m mm HIT]. Here we provide only a short 
review. A more detailed review on this topic can be found in 01 II . 
Several eaily studies |20 21, 34], such as HEPV^ andHiTi (21], 
utilize the decomposition of a topological map to speed up shortest 
path search. Recently, a variety of techniques 1111 . such as A* 1151 , 
Arc-flag (directing the search towards the goal) ||4], highway hier- 
archies (building shortcuts to reduce search space) 0171 13II . tran- 
sit node routing (using a small set of vertices to relay the short- 
est path computation) (3), and utilizing spatial data structures to 
aggressively compress the distance matrix t30ii32i , have been de- 
veloped. However, the effectiveness of these approaches rely on 
the essential properties of road networks, such as almost planar, 
low vertex degree, weighted, spatial, and existence of hierarchical 
structure II16I . As we mentioned before, social networks have dif- 
ferent properties, such as non-spatial, unweighted, scale-free (exis- 
tence of hubs), and exhibiting small-world properties. For instance, 
those techniques utilizing spatial properties (triangle inequality) for 
pruning the search space immediately become infeasible in social 
networks. Also, the high vertex degree (hubs) easily lead to the 
explosion of the search space. 

Theoretical Distance Labeling and Landmarking: There have 
been several studies on estimating the distance between any ver- 
tices in large (social) networks ||26l|9l[T6l|4T]|42l[29). These meth- 
ods in general belong to distance-labeling III3I , which assigns each 
vertex it a label (for instance, a set of vertices and the distances 
from u to each of them) and then estimates the shortest path dis- 
tance between two vertices using the assigned labels. The seminal 
work, referred to as the distance oracle 0361 . by Thorup and Zwick 
shows a {2k — l)-multiplicative distance labeling scheme (the ap- 
proximate distance is no more than 2k — 1 times the exact distance), 
for each integer fc > 1, with labels of 0(n^'''° log^ n) bits. How- 
ever, as Potamias et al. 1261 argued, for practical purposes, even 
A; = 2 is unacceptable (due to the small-world phenomenon). Re- 
cently, Sarma et al. (9] study Thorup and Zwick's distance oracle 
method on real Web graphs and they find this method can provide 
fairly accurate estimation. 

The pioneering 2-hop distance method by Cohen et al. (7) pro- 
vides exact distance labeling on directed graphs (very similar to the 
2-hop reachability indexing). Specifically, each vertex u records a 
list of intermediate vertices Lout iu) it can reach along with their 
(shortest) distances, and a list of intermediate vertices Lin (u) which 
can reach it along with their distances. To find the distance from 
u to V, the 2-hop method simply checks all the common interme- 
diate vertices between Loutiu) and Lin{v) and chooses the ver- 
tex p, such that dist{u,p) + dist{p,v) is minimized for all p £ 
Lout{u) n Lin{v). However, the computational cost to construct 
an optimal 2-hop labeling is prohibitively expensive 13311181 . 

Several works use landmarks to approximate the shortest path 
distance 1281 1221 |26l 141 1 l42l 1291 . Here, each vertex precomputes 
the shortest distance to a set of landmarks and thus the landmark 
approach can be viewed as a special case of 2-hop and distance 
labeling where each vertex can record the distance to different ver- 
tices. Potamias etal. 1261 investigate the selection of the optimal set 
of landmarks to estimate the shortest path distance. Qiao et al. 1291 
observe that a globally-selected landmark set introduces too much 
error, especially for some vertex pairs with small distance, and so 
propose a query-load aware landmark selection method. Zhao et 
al. |42| introduce Rigel, which utilizes a hyperbolic space embed- 
ding on top of the landmark to improve the estimation accuracy. 
Approximate Shortest Path Computation in Social Networks: 
A few recent studies aim to compute the shortest path in large social 



networks. They extend the distance-labeling or the landmarking ap- 
proach to approximate the shortest paths. Gubichev et al. propose 
Sketch, which generalizes the distance oracle method ||36l l9l to dis- 
cover the shortest path (not only the distance) in large graphs |16|. 
They observe that the path lengths are small enough to be consid- 
ered as almost constant and therefore store a set of precomputed 
shortest path in addition to the distance labeling. They also pro- 
pose several improvements, such as cycle elimination (SketchCE) 
and tree-based search (TreeSketch), to boost the shortest path esti- 
mation accuracy. Zhao et al. |42| develop RigelPath to approximate 
the shortest path in social networks on top of their distance estima- 
tion method, Rigel. Their basic idea is to use the distance estima- 
tion to help determine the search direction and prune search space. 
Sketch is the fastest approximate shortest path method, though Rigel- 
Path and TreeSketch can be more accurate. In addition, RigelPath 
mainly focuses on the undirected graph, while Sketch can handle 
both directed and undirected graphs. 

Other Recent Progress on Shortest Path Computation: Very re- 
cently, there have been a few studies in the database research com- 
munity on shortest path and distance computation. In 1401 , Wei 
develops a tree decomposition indexing structure to find the short- 
est paths in an unweighted undirected graph; In ||5), a hierarchical 
vertex-cover based approach is developed for single-source on-disk 
shortest path (distance) computation. In jS], Cheng etal. introduce 
fc-reach problem which provides binary answer to whether two ver- 
tices are connected by k steps. Also, the fc-reach indexing approach 
developed in (6) is not scalable and can only handle small graphs 
(as it tries to materializes the vertex pairs within certain distance 
threshold). Finally, Jin et al. fl9\ propose a highway-centric label- 
ing (HCL) scheme to efficiently compute distance in sparse graphs. 
Leveraging highway structure, this distance labeling offers a more 
compact index size compared to the state-of-the-art 2-hop labeling, 
and is also able to provide both exact and approximate distance 
with bounded accuracy. However, it is hard to scale to large social 
networks as real social networks are generally not sparse and po- 
tentially lead to expensive index construction cost and large index 
size. 

3. HUB-ACCELERATOR FRAMEWORK 

In this section, we give an overview of the Hub-Accelerator (HA) 
framework for the shortest path computation. In the earlier discus- 
sion, we observe a love-hate relationship between shortest-path and 
hubs: on one hand, any shortest paths likely contain some hubs and 
thus need to be visited in the shortest path search process; on the 
other hand, in order to provide the fast shortest path search, we 
need to try to avoid a full expansion of hub nodes. We note that 
in general, the notation of hubs is rather informal though generally 
based on degree; in this paper, we simply refer to the set of ver- 
tices whose degree are the highest (top /? number of vertices; /? is a 
constant and can be specified). 

The design of Hub-Accelerator aims to utilize these hubs for 
shortest-path computation without fully expanding their neighbor- 
hoods. To achieve this, the following research questions need to 
answered: 

1. How we can limit the expansion of hubs during the shortest path 
search? A hub may have thousands or even millions of connections 
(neighbors); what neighbors should be considered to be essential 
and given high priority in the shortest path search? To address this 
question, we formulate the hub-network notation, which captures 
a high-level view of the shortest path and topology between these 
hubs. The hub-network can be considered a highway structure an- 
chored by hubs for routing the shortest paths in a massive social 
network. Due to the importance of hubs, most shortest paths be- 



tween non-hub vertex pairs may need go through such a network, 

1. e., the starting vertex reaches a hub (as the highway entry), then 
travels to another hub (as the highway exit), and finally leaves the 
highway reaching the destination. In other words, the hub-network 
can be used to limit (or prioritize) the neighbors of hubs; a hub 
should only expand within the hub-network. 

2. How we can effectively and efficiently utilize the hub-network 
for shortest path search? Note that the hub-network captures the 
shortest paths between hubs. However, not all shortest paths be- 
tween vertices need to go through the hub-network: they may not 
contain any hub or they may consist of only one hub (in the later 
case, no traversal may be needed in the hub network). Thus, the 
problem is how we can extend the typical bidirectional BFS to 
adopt the hub-network for speeding up the shortest path compu- 
tation? 

3. Can we completely avoid the expansion of hubs? In this way, 
even the hub-network becomes unnecessary. But what essential 
information should be precomputed? When the number of hubs 
is not large, say lOA', then the pair-wise distance matrix between 
hubs may be materialized. For \QK hubs, this only costs about 
100MB = IQK X IQKb (assuming the distance can be held in 8 
bits), but additional memory may be needed to recover the short- 
est path. Given this, how can bidirectional search take advantage 
of such a matrix and what other information may also need to be 
precomputed? 

In this work, by investigating and solving these problems, we are 
able to utilize the hubs effectively to accelerate the shortest path 
search while significantly reducing or avoiding the cost of expand- 
ing them. Specifically, we make the following contributions: 
Hub-Network Discovery (Section|4j: The concept of hub-network 
is at the heart of the Hub- Accelerator framework: given a collection 
of hubs, a distance-preserving subgraph seeks to extract a minimal 
number of additional vertices and edges from the original graphs 
so that the distance (and shortest path) between hubs can be recov- 
ered, i.e., their distances in the hub-network are equivalent to their 
distances in the original graph. As we mentioned before, the hub- 
network serves as the highway in the transportation system to en- 
able the acceleration of the shortest path search: any hub will not be 
fully expanded (in the original graph); instead, only their neighbors 
in the hub networks will be expanded. Interestingly, though the dis- 
covery of a distance-preserving subgraph (and hub-network) seems 
rather intuitive, the computational aspect of the problem has not 
been studied before (despite similar notions being defined in theo- 
retical graph theory 1101 ). In Section|4] we show the NP-hardness 
of discovering the minimal distance-preserving subgraph and we 
develop a fast greedy approach to extract the hub-network (and the 
distance-preserving subgraph). Our experimental study shows the 
degree of hubs in the hub-network is significantly lower than that 
in the original graph; thus the hub-network can limit the expansion 
of hubs and enables faster shortest path computation. 
Hub-Network based Bidirectional BFS (Section |5) As we men- 
tioned above, it is nontrivial to incorporate the hub-network into 
the bi-directional BFS. In general, if we use the hub-network and 
also expand the hubs within the network, then the searches in both 
directions cannot simply be stopped when they meet at a common 
vertex. This is because the hub-network does not capture those 
shortest paths consisting of only one hub. 

Hub^-Labeling (Section[6): In this technique, we further push the 
speed boundary for shortest path computation by completely avoid- 
ing expanding any hub. To achieve this, a more expensive though 
often affordable precomputation and memory cost is used for faster 
online search. It consists of three basic elements: 1) First, instead 
of extracting and searching the hub-network, this technique mate- 
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Figure 1 : Running Example of Hub-Accelerator Framework 

rializes the distance matrix of those hubs, referred to as the Hub'^ 
matrix. As we mentioned before, even for lOK hubs, the matrix 
can be rather easily materialized. 2) Hub-Labeling is introduced 
so that each vertex will precompute and materialize a small num- 
ber of hubs (referred to as the core-hubs) which are essential for 
recovering the shortest path using hubs and hub-matrix. 3) Given 
the Hub^ distance matrix and hub-labeling, a faster bidirectional 
BFS can be performed to discover the exact fc-degree shortest path. 
It first estimates a distance upper bound using the distance matrix 
and the hub labeling. No hub needs to be expanded during the bidi- 
rectional search, i.e., hub-pruning bidirectional BFS. 

4. HUB-NETWORK DISCOVERY 

In this section, we formally define the Hub-Network (Subsec- 
tion |4TJ and present an efficient approach to discover it (Subsec- 
tion|42t. 

To facilitate our discussion, we first introduce the following no- 
tation. Let G = {V, E) be a graph where V — {1, 2, ...n} is the 
vertex set and E GV x V is the edge set. The edge from vertex u 
and V is denoted by (u,v), and we use P{i'o, ^p) = (wo, fi , t'p) 
to denote a simple path between vo and Vp. The length of a simple 
path is the number of edges in the path, denoted by \P{vo, Vp)\- 
Given two vertices u and v, their shortest path SP{u, v) is the path 
between them with the minimal length. The distance from ver- 
tex u to 1} is the length of shortest path SP{u, v) between it and 
V, denoted by d{u,v). Note that for a directed graph, the edge 
set may contain either {u,v), {v,u), or both. For an undirected 
graph, the edge has no direction; in other words, it can be consid- 
ered bidirectional, so the edge set contains either both edges (u, v) 
and {v,u) or neither of them. In undirected graph, the shortest 
path distance from it to « is equivalent to the one from i; to it, i.e., 
d{u,v) — d{y,u). The techniques discussed in the paper can be 
applied both undirected and directed graph; for simplicity, we will 
focus on the undirected graph and we will briefly mention how each 
technique can be naturally extended to handle directed graphs. 

4.1 Distance-Preserving Subgraph and Hub- 
Network 

Intuitively, a hub-network is a minimal subgraph of the original 
G, such that at least one shortest path between two hubs can be 
recovered in the subgraph (the distance is preserved). To formally 
define the hub-network, we first introduce the concept of distance- 
preserving subgraph and its discoveiy 

Definition I. Distance-Preserving Subgraph Given graph 
G = {y,E) and a set of vertex pairs D — (it, i") C V x V, 
a distance-preserving subgraph Gs = {Vs,Es) of G (Vs C V 
and Es C E) has the following property: for any {u,v) £ D, 
d{u, v\Ga) = d{u,v\G), where d{u, v\Ga) and d{u,v\G) are the 
distances in subgraph Gs and original graph G, respectively. 



Given a collection of vertex pairs whose distance need to be pre- 
served in the subgraph, the subgraph discovery problem aims to 
identify a minimal subgraph in terms of the number of vertices (or 
edges). 

Definition 2. Minimal Distance-Preserving Subgraph 
(MDPS) Problem Given graph G = {V, E) and a set of vertex 
pairs D = (it, i;) C V X V^, the minimal distance-preserving 
subgraph (MDPS) problem aims to discover a minimal subgraph 
Gg = (Vg , E*) with the smallest number of vertices, i.e., G^ = 
argminjVj,| Gs, where Gs ~ {Vs,Es) is a distance-preserving 
subgraph with respect to D. 

Once all the vertices Vf are discovered, the induced subgraph 
GfKj*] of G is a candidate minimal subgraph. Note that its edge 
set may be further sparsified. However, the edge sparsification 
problem with respect to a collection of vertex pairs (equivalent to 
the minimal distance-preserving subgraph problem in terms of the 
number of edges) is equally difficult as the MDPS problem (see dis- 
cussion below); and the number of edges which can be removed are 
typically small in the unweighted graph. Thus, we will not explore 
the further edge reduction in this work. 

Given graph G = {V, E) and a set of hubs C V, let 
contain all the hub pairs whose distance is no greater than k, then 
the hub-network is defined as the minimal distance-preserving sub- 
graph of Dk in G. 

Example 4.1. Figure\l\a) shows the network we will use as 
a running example. Figure\l\b) is the corresponding hub-network 
with H = {4, 6, 8, 12, 17, 18, 19} (degree > 5) when fc = 4. Since 
the pairwise distances between these hubs are all less than 4, Da 
contains all the hub pairs with a total oflh vertex pairs. 

Note that an alternative approach is to build the weighted hub- 
network which explicitly connects the hub pairs: for instance, if any 
other hub lies in a shortest paths between two hubs, an edge can be 
added to directly link them. Indeed, most of the existing studies 
have adopted a similar approach to build and utilize some highway 
structure (but they target mainly road networks, which are rather 
sparse). However, this approach can lead to a number of problems 
when searching a massive social network; I) Such hub-network 
would be weighted and could be dense (many new edges may need 
to be added between hubs) and to search through it, Dijkstra's al- 
gorithm (or its variant) must be utilized and would be slower than 
BFS (because of using the priority queue). Higher edge density ex- 
acerbates this slowdown. 2) Bidirectional BFS is typically used to 
search an unweighted network and could be adopted to search the 
remaining network (excluding the hub-network). However, com- 
bining bidirectional BFS with Dijkstra's can be rather difficult; 3) 
Significant memory may be needed to record such a hub-network as 
it is rather dense. Moreover, to recover the shortest path, additional 
information has to be recorded for each added new edge. Consid- 
ering these issues, we utilize the distance-preserving subgraph as 
the hub-network, which does not induce additional memory cost, 
and can naturally support (bidirectional) BFS. Note that in Sec- 
tions |5] and [6] we will study how to use more memory for higher 
query performance (without involving the difficulty of weighted 
hub-network). 

To discover the hub-network in a massive social network, we 
need a fast solution for the Minimal Distance-Preserving Subgraph 
(MDPS) problem. However, finding the exact optimal solution is 
hard. 

Theorem I. Finding the minimal distance-preserving subgraph 
of a collection D of vertex pairs in a graph is an NP-hard problem. 



Proof Sketch: We reduce the set-cover decision problem r to the 
decision version of the minimal distance-preserving subgraph prob- 
lem. In the set-cover decision problem, let U be the ground set and 
C records all the candidate sets, where for any candidate set C G C 
and C ^U. The set-cover decision problem asks whether there are 
K or fewer candidate sets in C, such that UiC, = U. 

Now we construct the following MDPS instance based on a set 
cover instance: consider a tripartite graph G = {XuYuZ, ExyU 
Eyz) where the vertices in X and Z have one-to-one correspon- 
dence to the elements in the ground set U, and the vertices in Y 
one-to-one correspond to the candidate sets in C. For simplicity, 
let It G W G X{zu G Z) (vertex x-u (zu) corresponds to 

element it); and let C G C G F (vertex yc corresponds 

to candidate set C). Then, the edge set Exy {Eyz) contains all 
the edges {xu,yc) ({yc, Zu)) if element u belongs to the candidate 
set C. Note that the tripartite graph can be considered symmetric 
{X = Z and Exy = Eyz). 

We claim that the set-cover decision problem is satisfiable if and 
only if the following MDPS problem is true: there is a subgraph G 
with 2\U\ -\- K vertices to cover the shortest path distance of |(7 
vertex pairs (Xu, Zu), u £ U. 

The proof of this claim is as follows. Assume the set-cover prob- 
lem is satisfiable, let Ci, ■ ■ • Ck{k < K) be the k candidate sets 
which covers the ground set, i.e., UCi = U. Let Yc include all the 
vertices in Y corresponding to (7i, • • • ,Ck- It is easy to observe 
the induced subgraph of G[X U Yc U Z] can recover the distances 
of all \U\ pairs (Xu, Zu), u £ U. Note that their distances in the 
original graph G and the induced subgraph G[X U Yc U Z] are all 
equal to 2. 

From the other direction, let Gs be the subgraph with 2\U\ + K 
vertices which recovers the distances of these |L*'| pairs. Since the 
vertices in the pairs have to be included in the subgraph (otherwise, 
the distance can not be explicitly recovered), the additional K ver- 
tices can only come from the vertex set Y (there are 2\U\ in the 
vertex pairs from X and Z). Note that the distance of (Xu, Zu) in 
the original graph is 2 and to recover that, a vertex yc in Y has to 
appear in the subgraph so that both {xu,yc) and {yc,Zu) are in 
the subgraph (and in the original graph). This indicates the corre- 
sponding candidate set C covers element u. Since there are at most 
K vertices in Y , there are at most K candidates needed to cover all 
the ground set f/. □ 

Based on similar reduction, we can also prove that finding the 
minimal distance-preserving subgraph in terms the number of the 
edges is also an NP-hard problem. Due to simplicity, we will not 
further explore this alternative in the paper. 

4.2 Algorithm for Hub-Network Discovery 

In the subsection, we will discuss an efficient approach for dis- 
covering the distance-preserving subgraph and the hub-network. To 
simplify our discussion, we focus on extracting the hub-network, 
though the approach is directly applicable to any collection of ver- 
tex pairs (and thus the general distance-preserving subgraph). Re- 
call that in the hub-network discovery problem, given a set H of 
hubs and a collection D of hub-pairs whose distance is no more 
than k (for fc-degree shortest path search), then the goal is to re- 
cover the distance for the pairs in D using a minimal (distance- 
preserving) subgraph. 

To tackle the hub-network (and the distance-preserving subgraph) 
efficiently, we make the following simple observation. For any ver- 
tex pairs {x, y) in D, if there is another hub z, such that d{x, y) = 
d{x, z) + d(z, y), then we refer to the vertex pair (a;, y) as a com- 
posite pair, otherwise, it is a basic pair, i.e., any shortest path con- 
necting X and y does not contain a hub in H. Let ft, C _D be the 



set of basic pairs. Given this, it is easy to see that if a subgraph can 
recover all the vertex pairs in Db, then it is a distance-preserving 
subgraph of D (and thus the hub-network). This indicates that we 
only need to focus on the basic pairs (Dt) as the distances of com- 
posite pairs can be directly recovered using the paths between basic 
pairs. 

Considering this, at the high level, the algorithm of the hub- 
network discovery performs a BFS-type traversal from each hub 
h and it accomplishes the two tasks: 1) during the BFS, all basic 
pairs including h, i.e., {h, v),v £ H, should be recognized and col- 
lected; and 2) once a basic pair {h, v) is identified, the algorithm 
will select a "good" shortest path which consists of the minimal 
number of "new" vertices (not included in the hub-network yet). In 
other words, as we traverse the graph from each hub, we gradually 
augment the hub-network with new vertices to recover the distance 
(shortest path) of the newly found basic pairs. 
Recognizing basic pairs: To quickly determine whether the {h, v) 
is a basic pair during the BFS process starting from hub h, we uti- 
lize the following observation: Let vertex y lie on a shortest path 
from hub h to non-hub vertex v with distance d{h, t;) — 1 (i.e., y is 
one hop closer than v with respect to h). If there is a hub h' appear- 
ing in a shortest path from h to y (h' and y may not be distinct), h' 
definitely lies on a shortest path from h to v and [h, v) is a com- 
posite pair (not basic pair). Based on this observation, we simply 
maintain a binary flag b{v) to denote whether there is another hub 
appearing in a shortest path between h and v. Specifically, its up- 
date rule is as follows: b{v) — (not basic pair) if v itself is a hub 
or b{y) = (y is v's parent in the BFS, i.e., d{h, y) — d{h, v) — 1 
and d{y, v) = 1). Thus, during the BFS traversal, when we visit 
vertex v, if its flag b{v) — 1 (true) meaning there is no other hubs 
lying on the shortest path between h and v and we are able to rec- 
ognize it is a basic pair. 

Selecting a "good" shortest path between basic pairs: To select 
a good shortest path between basic pairs h and v, a basic mea- 
surement is the number of "new vertices" that need to be added to 
the hub-network. As a greedy criterion, the fewer that need to be 
added, the better is the path. To compute this, for any shortest path 
from starting point h to v, a score / records the maximal number 
of vertices which are already in the hub-network. This measure can 
be easily maintained incrementally. Simply speaking, its update 
rule is as follows: f{v) — maxf{'u) + 1 if t; itself is in the hub- 
network or f{v) — maxf{u), where u is u's parent in the BFS 
(a shortest path from /t to t; go through u and u directly links to 
v). Also vertex v records u which has the maximal / for tracking 
such a shortest path (with maximal number of vertices in the hub- 
network). Finally, we note that only for vertices v with b{v) — 1, 
i.e., when the shortest path between h and v does not go through 
any other hub, does a score / need to be maintained. Otherwise, v 
and its descendents cannot produce any basic pairs. 

Overall Algorithm: The outline of this BFS-based procedure for 
discovering the hub-network is described in Algorithm [T] Here H* 
is the set recording the vertices in the hub-network. Initially, H* = 
H and then new vertices will be added during the processing. Note 
that in the queue for BFS traversal (Line 3), we always visit those 
vertices with b{u) = 0, i.e., they and any of their descendents (in 
the BFS traversal) will not form a basic pair, and thus the score / 
does not need to be maintained for them. Once a hub is visited and 
it initially has b{u) = 1, then (h, u) is a basic pair (Line 5); we will 
extract the shortest path which has the maximal number of vertices 
in the hub-network and add the new vertices to H* (Line 6). Now, 
since the descendent of this hub (in the BFS traversal) will not form 
a basic pair, we simply change its flag to false, i.e., b{u) — (Line 



Algorithm 1 BFSExtraction(G = (V, E), h, H, H*) 



and parent(8)=3) as its f score is higher than the that of vertex 6. 



1 : Initialize b{u) •<— 1; f{u) for eacli vertex u; 

2: level{h) <- 0; Q <- {h} {queue for BFS}; 

3: while Q ^ % {vertices witli h{u) = visited first at eacli level} do 
4: u *r- Q.popQ; 

5: if M S and level(u) > 1 and b{u) = 1 {basic pair} then 

6: extract shortest path SP{h, u) with minimal /(«) and add to H* 

7: &(«) <— {all later extension will become false} 

8: end if 

9: 'iilevel(u) = k {no expansion more than levelfc for fc-degree short- 
est path} then 

10: continue; 

11: end if 

12: if b{u) = 1 and w e then 

13: f{u) <— f{u) + 1 {increase /} 

14: end if 

15: for all v S neighbor{u) {{u, v) G E; expanding u} do 

16: if ti is not visited then 

17: add v to queue Q; 

18: eke it level{v) = level(u) + 1 then 

19: if b{u) = {update 6} then 

20: b{v) -s- 0; 

21: else if b{v) = 1 and f{u) > f{v) {update /} then 

22: f{v) •(— f{u) andparentiy) <— u; 

23: end if 

24: end if 

25: end for 

26: end while 




(a) Flag b(v) (b) score f(v) 



Figure 2: Incremental Maintenance of flag h and score / 



7). Also, since we are only interested in the shortest path within 
fc-hop, we will not expand any vertex with distance to /i to be fc 
(Lines 9—11). Before we expand the neighbors of u, we also need 
to update its / score based on whether it itself is in the hub-network 
(Line 12 - 14). 

The complete expansion of a vertex u is from Line 15 to 28. We 
will visit each of its neighbors v. \fv has not been visited, we will 
add it to the queue for future visiting (Line 16 — 18). Then we 
perform the incremental update of flag b{v) and score f(v). Flag 
h{v) will be turned off if b{u) = (Line 20 - 22) and if f{u) is 
larger than f{v), i.e., the shortest path from /i to it has the largest 
number of vertices so far in the hub-network. Vertex v will record it 
as the parent (for shortest path tracking) and f{y) is updated (Line 
24 — 26). This procedure will be invoked for each hub in H. 

Example 4.2. Figure^illustrates the flag b and score f in the 
BFS process. Here the vertices h, 2, 4, 9, and 11 are hubs. In 
Figure\2\(a), {h, 2), {h, 4), and {h, 11) are basic pairs; the flag b 
changes from 6 = 1 originally to b = (Lines 5-7). After the flag b 
of 2,4:, and 11 changes to false (b = 0), all their descendents in the 
BFS traversal become false. For instance, the flag b of vertex 5 is 
false as it is also considered hub 2 's descendent. In Figure^b), the 
shaded vertex 3 indicates it is already included in the hub-network 
(3 £ H* ). Therefore, vertex 11 points to vertex 8 (parent( 1 1 )=8 



Theorem 2. If we invoke Algorithm\l\for each h £ H, then 
the induced subgraph G[H*] is a hub-network ofH with respect to 
the k-degree shortest path. 

Proof Sketch: The correctness of the algorithm can be derived 
from the following two observations: 1) for any basic pair (h, it) 
with distance no more than k, there is at least one shortest path 
in G[H*] as the algorithm explicitly extracts a shortest path and 
adds all its vertices to H*; 2) for any composite pair {h, h') with 
distance no more than k, then it can always be represented as a se- 
quence of basic pairs, which has at least one shortest path in G[H*]. 
Thus, for any hub pair [h, h!) with distance no more than k, their 
distance (at least one shortest path) is preserved in the induced sub- 
graph Gfi?*]. □ 

The computational complexity for hub-network discovery as de- 
scribed in Algorithm [T| is basically equivalent to that of a simple 
BFS procedure. The overall procedure takes ©(X^hGi/d-^fe + 
time, where H is the hub set, and N); (h) and Ek (h) are the number 
of vertices and edges, respectively, in it's k-degree neighborhood. 
We also note that this algorithm works coiTectly for both undirected 
and directed graphs. Interestingly, we note the following property 
of applying Algorithm[T]for an undirected graph. 

Lemma 1 . Let {u,v) be a basic hub pair in an undirected graph. 
Consider Algorithm\l\performs BFS from u first and it discovers 
the shortest path SP{u, v). When it performs BFS from v and dis- 
covers the symmetric basic pair (v, u), the algorithm will not add 
any additional new vertices. 

Proof Sketch: The score / guarantees f{v) = |5'P(i', u)|=|5'P(it, ii)| 
and thus a shortest path as "good" as SP{u, v) will be extracted 
which does not need to add any new vertices to H* . □ 

This observation leads to the simple bound constraint of the hub- 
network (the final size of H*) and the result of Algorithm [T]will 
match such a bound. 

Lemma 2. Let D\. C H x H be the set of all unique 

basic hub pairs whose distance is no more than k, then, 

\H*\< Y. {d{u,v)-l) + \H\<\^{k-l) + \H\, 

where B is the average numnber of basic pairs per hub. 

Proof Sketch: The term v)eD'> ('^('"i ^) ~ 1) corresponds to 
the definition that any basic pair needs to recover only one short- 
est path; this also corresponds to the worst case scenario in Al- 
gorithm [T] where for any basic pair, all non-hub vertices along a 
new shortest path need to be added to H* . Note that for undi- 
rected graph D\ treats basic pairs (it, v) and (i;, it) as a single one. 
This directly leads to the term \H\B /2(k — l), which contemplates 
the maximal distance between any basic hub pair is k and only one 
shortest path needs to be recovered for symmetric basic pairs (it, v) 
and {y,u). Algorithm [T] also holds that (Lemma [TJ. Note that the 
result holds for directed graph as well where B is the total degree 
of both incoming and outgoing edges. □ 

5. HUB-NETWORK BASED SEARCH 

In this section, we describe the hub-network based bidirectional 
BFS. The main challenge here is given a hub-network, how we 
can leverage it to maximally reduce the expansion of hubs and still 
guarantee to discover the correct fc-degree shortest path? Recall 
that a key reason for introducing the hub-network is to use it to 



constraint the expansion of hubs. Thus, a basic search principle 
is that any hub will only visit its neighbors in the hub-network. 
But what about any non-hub vertices v in the hub-network, such 
as « £ H* \ HI Should they be expanded only within the hub- 
network or should they be treated as the remaining vertices outside 
the hub-network? Furthermore, in traditional bidirectional BFS, 
when two searches (forward and backward) meet for the first time, 
the shortest path is discovered. Unfortunately, this does not neces- 
sarily hold if the hub is not fully expanded and thus the question 
becomes: what should be the coixect stop condition? The stop con- 
dition is crucial as it determines the search space and the correct- 
ness of discovering the exact shortest path. 

In the following, we first describe the hub-network based bidi- 
rectional BFS algorithm (Subsection 15.11 and then we prove its 
correctness and discuss its search cost (Subsection 15.2V 

5.1 HN-BBFS Algorithm 

The Hub-Network based Bidirectional BFS (HN-BBFS) algo- 
rithm consists of a two-step process; 1) (Meeting step) A bidirec- 
tional search will traverse both hub-network and remaining graphs 
until the forward and backward searches meet at the first common 
vertex; 2) (Verification step) Next, the searches continues in the 
remaining graphs (not hub-network) to verify whether the path dis- 
covered in the first step is shortest. If not, this step will discover an 
alternative shortest path. 

Expansion Rule: In the Meeting step, the forward (backward) BFS 
follows the following rules to expand vertex vinG: 1) if a vertex is 
a hub, then it only expands its neighbors in the hub-network; 2) if a 
vertex is a regular vertex (not in the hub-network), then it expands 
all its neighbors; 3) for a vertex is a non-hub vertex but in the hub- 
network, H* \ H, if the BFS traversal first reaches it through a hub, 
then it only expands its neighbors in the hub-network; otherwise, 
it is considered a regular vertex (no shortest path from start (end) 
vertex to it going through a hub). In the Verification step, both 
forward and backward BFS traversals will continue but they will 
not need to expand any hub, and any regular vertex and non-hub 
vertices in the hub-network will expand all their neighbors in the 
entire network. 

Stop Condition: The stop condition for the forward (backward) 
BFS in the Verification step is as follows. Let dist be the shortest 
path distance discovered so far; let (dj*) be the distance between 
s (h) to its closest hub h; let lei}el j {levels) be the current level be- 
ing traversed by forward (backward) BFS. Then, the forward (back- 
ward) BFS will stop when the following condition is met: 

dist > level f + d^ + 1 {dist > leveh + dt + 1) (I) 

Overall Algorithm: Hub-Network based Bidirectional BFS (HN- 
BBFS) is sketched in Algorithmic] Note that BackwardSearch is 
essentially the same as ForwardSearch and is omitted for simplic- 
ity. Initially, dist is set to be fc + 1 for fc-degree shortest path search 
(indicating no path within fc-hops) and the met condition is false 
(Line 2). 

The first step (Meeting Step) is carried out by the first while loop 
(Lines 3 — 6), where a forward search and a backward search are 
employed in an alternating manner. In ForwardSearch (and Back- 
wardSearch), a vertex in the corresponding queue Q/ {Qt) is ex- 
panded. The expansion rule as described earlier is used in Line 15. 
Basically, if a vertex is a hub or is in the hub-network, H* \ H, but 
the BFS traversal first reaches it through a hub (there is a shortest 
path from s to it via a hub), it is considered "in-hub-network". Oth- 
erwise, it is "out-hub-network". For an in-hub-network vertex, BFS 
only expands its neighbors in the hub-network. Note that recogniz- 
ing these "in-hub-network" vertices is straightforward and can be 



Algorithm 2 HN-BBFS(G, G[H*], s, t) 



1 : Q/ {«}; Qb ^ {Queues for forward and backward search} 
2: dist •<— A: -I- 1; met <— false; 

3: while (Qf ^ AND ^ 0) AND NOT met AND d(s, Qf .top) + 

d{Qi,.top,t) < dist do 
4: ForwardSearch((3 f , false); {not Verification Step} 
5: BackwardSearch(Q(, , /aise); 
6: end while 

7: stopf <— false; stopi, <— false; 

8: while (NOT ((Q/ = OR stopf) AND (Qt = OR stopi))) do 
9: NOT stopf; ForwardSearch(Q f , true); {true: Verification Step} 
10: NOT stop;,: BackwardSearch((3(, , true) 
1 1 : end while 

12: return dist and shortest path; 

Procedure ForwardSearch((3 f ,V erification) 

13: u <— Q f.popO {if Verification is true, only out-hub-network ver- 
tices will be visited} 
14: u is set to be visited by forward search; 

15: for all v <— neighbor{u) {if m is a hub or there is a shortest path 

from s to u via a hub, neighbor(u) is within the hub-network} do 
16: if 11 is visited by backward search {searches meet} then 
17: iid{s,u) + d{v,t) + 1 < dist then 
18: update dist and the shortest path correspondingly; 

19: if NOT met {the first time meet} then 

20: met <— true 

21: end if 

22: end if 
23: end if 

24: if 1) is not visited AND NOT (Verification and v £ H) then 
25: Qf.pushJ)ack{v); 
26: end if 

27: if Verification AND dist > d{s, v) -|- dj* + 1 then 
28: stopf <r- true; 
29: end if 
30: end for 



incrementally computed (similar to using the flag b in Algorithm[T). 
Once a forward (backward) search visits a vertex already visited by 
the backward (forward) search, a candidate shortest path is discov- 
ered and met is set to be true. Note that when Verification is 
false (at the first step), every vertex (both hubs and non-hubs) will 
be visited and expanded. 

Once met turns true, the second step (Verification Step) is car- 
ried out by the second while loop (Lines 8—11). Before the forward 
stop condition is met (stopf is false), the ForwardSearch will con- 
tinue. However, only out-hub-network vertices will be visited and 
expanded (Line 13 and Lines 24 — 26). Also, during the expansion 
process, the candidate shortest path can be updated (Lines 17—19). 
Finally, when the stop condition is met (Line 26: d{s, v) is the cur- 
rent BFS level being expanded, thus level f), stopf will become 
true and no forward search will not performed (Line 9). Note that 
(di) can be easily computed during the BFS traversal: the first 
time a hub is visited, its distance to s is recorded as d^. 

5.2 Correctness and Search Cost 

We now discuss the correctness of HN-BBFS (Algorithmic) and 
then its search cost (especially in terms of the new Stop condition, 
FormulaU). To prove the correctness of HN-BBFS, we will make 
the following important observations: 

Lemma 3. For any hub h £ H, during the first step (Meeting 
Step), the distance d(s , h) computed using the forward BFS search, 
i.e., the number of traversal levels to reach h, is the exact shortest 
path distance between s and h. The same holds for d{h, t) for the 
backward BFS traversal. 

Proof Sketch: If s is a hub, then based on the hub-network defini- 
tion, this clearly holds. If s is not a hub, then one of the following 



two cases must hold: 1) All shortest paths between (s, h) do not 
contain a hub except h, so the forward BFS finds the shortest path 
distance d{s, h) by traversing only non-hub vertices in the original 
graph; 2) There is a shortest path between (s, h) containing an- 
other hub, so there is always h' , such that {s,h') does not contain 
any hubs and (ft/, h) can be discovered in the hub-network. □ 

Lemma[3]demonstrates the power of the hub-network and shows 
that HN-BBFS can correctly calculate the shortest path (distance) 
between query vertices to hubs (and between hubs). However, de- 
spite this, the candidate shortest path being discovered at the first 
meeting vertex may not be the exact one. The following lemma 
categorizes the exact shortest paths if they are shorter than the can- 
didate shortest path discovered in the first step (Meeting Step). 

Lemma 4. Assuming u is the meeting vertex where forward 
and backward search first meet (Lines 22 — 26 in Algorithm^, 
the candidate shortest path is denoted as SP{s, u, t) and the dis- 
tance dist is d{s, u) + d{u, t). If there is a shorter path, then 
it must contain a hub h, such that the exact shortest path can be 
represented as two segments SP{s, h) and SP{h, t). Moreover, 
either 1) SP{s,h) contains no hub other than h with distances 
d{s,h) > d{s,u) and d(h,t) < d(u,t), or 2) SP{h,t) con- 
tains no hub other than h with distances d{s,h) < d{u,t) and 
d{h,t) > d{u,t). 

Proof Sketch: We prove this by way to contradiction. It the lemma 
does not hold, then the following two types of paths cannot be 
shorter than the discovered candidate shortest path: 1) there is 
no hub in the exact shortest path SP{s,t), and 2) there are two 
hubs hs and ht, such that the shortest path has three segments: 
SP{s,h,), SPihs,ht) and SPiht,t) where d{s,hs) < d{s,u) 
and d{ht,t) < d{u,t). For the first case, the bidirectional BFS 
should be able to find such a path (if they are shorter than the can- 
didate SP{s, u, t)) earlier as it only involves visiting non-hub ver- 
tices in the graph. For the second case, based on Lemma |3] Al- 
gorithm|2]computes the exact d{s, hs) and d{ht,t) before the two 
BFS met at u and the hub-network encodes the correct distance 
between d{hs,ht). Thus, if d{s,hs) + d{hs,ht) + d{ht,t) < 
d{s, u) + d{u, t), this shortest path should be discovered (met at an 
in-hub-network vertex) during the first step (Meeting Step). Since 
both cases are impossible, the lemma holds. □ 

Theorem 3. The Hub-Network based Bidirectional BFS ap- 
proach (HN-BBFS, Algorithm^ guarantees the discovery of the 
exact k-degree shortest path. 

Proof Sketch: Basically, we need show that when the stop con- 
dition is met, no shorter alternative paths exists. By Lemma |4l if 
a shortest path exists that is better than the candidate shortest path 
SP{s, u, t)), it must follow one of two simple formats. These for- 
mats suggest we only need to extend out-hub-network vertices until 
they meet a hub already visited from the other direction (d{s, hs ) < 
d(s, u) or d(ht,t) < d(u, t)). If such a path can be found, it must 
be shorter than the already discovered distance diat, i.e., dist > 
level f + 1 + dj" (the best case situation is when the shortest path 
extends from the current step by one step to a hub closest to the 
query vertices). Clearly, if this does not hold, any shortest path in 
this format will not be smaller than dist. □ 

In classical Bidirectional search, once both directions meet at a 
common vertex, the search can be stopped and the exact shortest 
path is discovered. However, in HN-BBFS, in order to reduce the 
expansion of hubs, some additional traversal (Verification Step) has 
to be taken. Clearly, if we need to walk fc/2 additional steps, then 
the benefit of HN-BBFS can be greatly compromised. 



So, what is the average number of steps HN-BBFS needs to take 
for a typical (random) query in the Verification step? The number 
is close to zero or at most one. To illustrate, first consider the dis- 
tance between two vertex pairs to be 6 (since most distances are 
less than that in social networks 1391 ), and assume s and t are not 
hubs (because there are few hubs) but each of them has a direct hub- 
neighbor = 1 (dt = 1). Note that both directions typically tra- 
verse at most three steps, i.e., level f — levels — 3. Thus, at most 
one extra step needs to be taken in this case to make the stop condi- 
tion true: dist — level f —dt —1 > 0, where level f = 4. Similarly, 
let us consider the distance to be 4 and assume each direction has 
taken 2 steps in the Meeting Step. In this case, there is no need to 
take an additional step (assuming s and t are not hubs), and we can 
immediately recognize that the candidate shortest path is indeed the 
exact one. Finally, we note that when dist — level j — dj' — 1 = 1, 
i.e., the last step of BFS for Verification, there is no need to ex- 
pand all the neighbors of a given vertex. Only its immediate hub- 
neighbors need to be expanded and checked (Lemma |4] and Theo- 
rem[3). To facilitate this, the neighbors of regular vertices can be 
reorganized so that the hub-neighbors and non-hub-neighbors are 
separately recorded. 

6. HUB -LABELING FOR SHORTEST PATH 
COMPUTATION 

In this section, we present a Hub^ -labeling approach which aims 
to completely avoid visiting (and expanding) any hub. To achieve 
this, more expensive though often affordable pre-computation and 
memory cost are utilized for faster online querying processing. In 
Subsection |6.1| we will describe the Hub^ -labeling framework and 
its index construction. In Subsection l6.2l we will discuss the faster 
bidirectional BFS. 

6.1 Hub^-Labeling Framework 

Hub'^ -Labeling replaces the Hub-Network with a Hub^ distance 
matrix and Hub Labeling. 

Hub^: The distance matrix between hub pairs (referred to as Hub^) 
is precomputed and stored in main memory. Indeed, only the dis- 
tances of pairs with distance no more than k need to be computed 
for fc-degree shortest path. As we discussed before, nowadays a 
desktop computer with moderate memory size can easily hold such 
a matrix for lOA' (or more) of hubs. 

Hub Labeling: In order to effectively utilize the distance matrix, 
each vertex v in the graph also records a small portion of hubs, 
referred to as the core-hubs, along with the distances. Basically, 
those core-hubs along with the distance matrix can help quickly es- 
timate the upper-bound of distance between the query vertex pairs 
and can be used for bounding the search step of bidirectional BFS. 
Now, we formally define the core-hubs. 

Definition 3. (Core-Hubs) Given graph G = (V, E) and 
a collection H of hubs, for each vertex v, we say vertex h G H 
is a core-hub for v if there is no other hub h' £ H such that 
d{v, h) = d{v, h') + d(/i', h). Formally L{v) ^ {h € Ti : $h' € 
H, d{v, h) = d{v, h') + d{h' , h)}. 

Simply speaking, if no other vertex h' appears in any shortest 
path between v and h, h is v's core-hub. Note that a pair {v, h), 
where v G L(«), is similar to a basic pair in the hub-network (Sub- 
section |4j2l(. The original basic pair definition only refers to hub 
pairs, but here it is being extended to vertex pairs with one hub and 
one non-hub vertex. 

Example 6.1. Figure\l\c) illustrate the core-hubs (along with 
the distance ) for each non-hub vertices in the original graph ( Fig- 
ure\r^aj). Here the hubs are 4, 6, 8, 12, 17, 18, and 19. For 



instance, Vertex 1 only needs to record core-hubs 4, 6, 12 and 19, 
and it can reach hubs 8 and 17 through them in some shortest path. 

Using the core-hubs L and distance-matrix Hub^ , we can ap- 
proximate the distance and the shortest path for vertex pair (s, t) in 
the following fashion: 

dnis, t) = m,in^f,L(s)^yeL{t){d{s, x) + d{x, y) + d{y, t)} (2) 

Here, d{x, y) is the exact distance recorded in the distance-matrix 
Hub^ 

The construction of the distance matrix Hub^ and the labeling of 
core-hubs are also rather straightforward. The BPS procedure in 
Algorithm[T]can be easily adopted: I) each BPS performs k steps 
and thus the distance matrix can be directly constructed; 2) when 
a vertex v has flag & = 1 (basic pair) from BPS traversal of h, we 
simply append h to L{v). Thus, the total computational complexity 
of the pre-computation is 0(X]hgH(-^'= -^k{h))) time, where 
H is the hub set and Nk{h) and Ek{h) are the number of vertices 
and edges, respectively, in u's k-degree neighborhood. We note that 
for directed graphs, we will compute both Lin{v) and Lout{v), one 
for incoming core-hubs {h, v) and the other for outgoing core-hubs 
{y,h). To construct such labels, we need perform both forward and 
backward BPS from each hub. 

The overall memory cost of Hub^ -Labeling is the sum of the cost 
of the distance matrix (Hub^) together with the core-hub labeling 
for each vertex (L(u)): Y.vev + 0{\H\^). This turns 

out to be rather affordable. In the experimental study, we found that 
for most of the real social networks, the core-hubs of each vertex 
V is only a small portion of the total hubs (in most case, less than 
or close to 2%). Thus, the Hub^-Labeling can easily handle graphs 
with more than lOK hubs. Purthermore, since the second term 
(the size of the distance matrix) is stable, as the number of vertices 
increases in the original graph, the first term will scale linearly with 
respect to \ V\. 

6.2 Hub^ -Labeling Query Processing 

To compute the fc-degree shortest path between vertex pair {s,t), 
the online query process in Hub^ -Labeling consists of two steps: 
Step 1 (Distance Estimation): Using the distance matrix Hub^ 
and core-hubs labeling L{s) and L{t), the distance dnis, t) is es- 
timated (Pormula|2}. 

Step 2 (Hub-Pruning Bidirectional BPS (HP-BBFS)): A bidi- 
rectional BPS from s and t is performed and the search step is con- 
strained by the minimum between k (for fc-degree shortest path) 
and dnisyt). In particular, none of the hubs need to be expanded 
during the bidirectional search. Mathematically, the Hub-Pruning 
Bidirectional BPS is equivalent to performing a typical Bidirec- 
tional BPS on the non-hub induced subgraph, G\V \ H] of G. 

Theorem 4. The two-step Hub^ -Labeling query process can 
correctly compute the k-degree shortest path in graph G. 

Proof Sketcli: We observe that any vertex pair with distance no 
more than k can be categorized as: 1) vertex pairs having at least 
one shortest path passing through at least one hub in H\ and 2) 
vertex pairs whose shortest paths never pass through any hub. 

Por any vertex pair {s,t) with distance no greater than fc (d(s, t) < 
k), if there exists one hub x' G Ti. satisfying d{s, t) = d{s, x') + 
d{x' , t), then, we can always find x £ Lh{s) and y € Lnit) such 
that d{s,t) — d{s,x) + d{x,y) + d(y,t). In other words. Step 
1 (distance estimation), which uses the distance-matrix Hub^ and 
core-hub labeling, can handle this category. Also, the Step 2 will 
help confirm the shortest path belongs to this category (cannot find 
a shorter one). 



If an approximate shortest path computed in Step 1 is not an 
exact one, then the shortest path does not involve any hub. Thus 
Step 2 can guarantee to extract an exact shortest path using the 
bidirectional search in the non-hub induced subgraph G\V\H]. □ 

The time complexity of online query processing of a pair s and t 
canbewrittenasO(|i(s)!|L(t)|-l-iVfc/2(s|G[l/\f/])+£;fc/2(s|G[l/\ 
H])+N'^/^{t\G[V \ H])+ E'^/S\G\V \ H])), where \L{s)\\L{t)\ 
is the distance estimation cost and the remaining terms are the cost 
of bidirectional search. Nk/2 ™d £'fc/2 (^'fe/2) the num- 

ber of vertices and edges in the fc/2-neighborhood (reversed neigh- 
borhood which follows the incoming edges) of the non-hub induced 
subgraph G\V \H]. Since the hubs are excluded, the cost of hub- 
pruning bidirectional BPS is significantly smaller than that on the 
original graph. 

However, if the number of core-labels is large, then the dis- 
tance estimation can be expensive (a pairwise join on L{s) and 
L{t) is performed). To address this issue, the core-hubs in L(u) 
can be organized in a level-wise fashion, each level coiTespond- 
ing to their distance to u, such as L^{u), L'^{u), ■ ■ ■ L^{u). Using 
such a level-wise organization, we can perform a much more ef- 
ficient distance estimation: the pairwise joins first performed be- 
tween L^(s) and L^{t); then on {L^{s),L^{t)), {L^[s),L^{t)), 
{L^{s), L^{t)), etc. Given this, let us denote d to be the shortest 
path length obtained by pairwise join so far. Assuming we are cur- 
rently working on {L^{s), U'{t)), if d < p -\- q, then we terminate 
the pairwise join immediately. This is because it is impossible for 
{L^ {s),L'^ (f)) to produce better results since p'+q' > p+g > d. 
This early teimination strategy based on the level-wise organization 
can help us effectively prune unnecessary pairwise join operations 
and improve the query efficiency. 

7. EXPERIMENTAL EVALUATION 

In this section, we empirically evaluate the performance of our 
algorithm on a range of large real social networks. In particu- 
lar, we will compare the Hub-Network approach (denoted as HN) 
and Hub'^ -Labeling approach (denoted as HL) with the following 
methods: 1) basic breadth-first search (denoted as BFS); 2) bidirec- 
tional breadth-first search (denoted as BiBFS); 3) the Sketch algo- 
rithm (9) (denoted as S*), the state-of-the-art approximate distance 
estimation algorithm; 4) the TreeSketch method 11161 (denoted as 
TS*), which utilizes a tree to improve the approximation accuracy 
of Sketch based shortest path computation. Here the symbol * also 
indicates it is an approximation method. 

In addition, we have also tested the two latest exact shortest path 
distance methods, including tree decomposition based shortest path 
computation (40| and the highway-centric labeling approach II9I 
based on authors' provided implementation. However, neither of 
them can work on the graphs used in this study. This is as expected 
as their indexing cost is very high (tree decomposition or set-cover 
approach) and they are mainly focusing on very sparse graphs . 

We also tested RigelPath, another recent approach on approxi- 
mate shortest path discovery in social networks II42I . However, its 
query performance is slower than that of Sketch (also confirmed 
in their own study 1421 ). Purthermore, its current implementation 
only focuses on undirected graphs, wheres most of the real bench- 
marking networks are directed. Thus, we do not report RigelPath's 
experimental results here. 

We implemented our algorithms in C++ and the Standard Tem- 
plate Libraiy (STL). The implementation of sketch-based approaches 
(including S* and TS*) is kindly provided by authors 1161 (also 
implemented in C++). All experiments were run on a Linux server 
with 2.48GHz AMD Opteron processors and 32GB RAM. 

In experiments, we are interested in two important measures: 



query time and preprocessing cost, which consists of precompu- 
tation time and indexing size. To measure the query time, we ran- 
domly generate 10, 000 vertex pairs and obtain the average running 
time for each query. For the index size, since all Sketch indices are 
stored in RDF format, their indexing sizes are measured in terms 
of the corresponding RDF file size. If the preprocessing cannot be 
finished in 48 hours, we will stop it and record "-" in the table of 
results. Furthermore, we note that all Sketch-based benchmarks 
can only approximate shortest paths, where approximation accu- 
racy is influenced by an iterative sampling procedure. A parameter 
r is specified to determine the number of sampling iterations, which 
leads to 2r log | V \ sketches for each vertex. To make a fair compar- 
ison with exact query schemes, we set r = 2 as suggested in 1161 
which can produce sketches with good approximation accuracy and 
efficient query processing. Also, in this study, we focus on compar- 
ing their query time again the new approaches despite they are only 
able to provide approximate solution whereas our approaches can 
provide the exact solution. 

The benchmarking datasets are listed in Table [T| Most of them 
are gathered from online social networks, with the number of ver- 
tices ranging from several tens of thousands to more than 10 mil- 
lion. Others also exhibit certain properties commonly observed in 
social networks, such as small diameter and relatively high average 
vertex degree. All datasets are downloadable from Stanford Large 
Network Dataset Collectionn, Max Planck Institute's Online Social 
Network Research Center Q and Social Computing Data Reposi- 
tory at Arizona State University 0. 

In Table[T] we present important characteristics of all real datasets, 
where 5 is average vertex degree (i.e., 2|£'|/|1/|) and do. 9 is 90- 
percentile effective diameter ||24| . Finally, in the experimental study, 
we focus on the 6-degree shortest path queries (fc = 6) as they are 
the most commonly used and also the most challenging one. 

7.1 Experimental Results 

In the following, we report effectiveness and efficiency of the 
shortest path computation algorithms from different perspectives: 
Query Results on Random Queries In this experiment, we ran- 
domly generate 10, 000 vertex pairs with various distances and ex- 
ecute all algorithms on these queries to study their performance. 
Here, we select 10, 000 vertices with highest vertex degree as hubs. 
Table [3] presents the average query time for 10, 000 queries on all 
the methods and Table|4]highlights the average query time for those 
vertex pairs whose distance is no less than 4 (longer path) as these 
are the more challenging ones (the longer the path, the likely more 
hubs will be expanded). Note that for BFS and two sketch methods 
Sketch(S*) and TreeSketch(TS*), we use the millisecond (10^'') as 
the unit, as they typically have much longer query time, and for 
BiBFS and our new approaches, Hub-Network (HN) and Hub^- 
Labeling (HL) approaches, we use the microsec ond (10'^') as the 
unit, as they are much faster. Their corresponding average search 
space per query is reported in Table |5] where column "HP-BBFS" 
records the average number of vertices visited by HP-BBFS (Hub- 
Pruning Bidirectional BFS) in Hub^ -Labeling (HL) and column 
"Join" records the average times of pairwise join on the core-hubs 
labeling L{s) and L{t) in HL. We make the following observations 
on the query time and average search space: 
1) The Hub'^ -Labeling (HL) is clearly the winner among all al- 
gorithms, which is on average more than 2000 times faster than 
BFS. In most of the social networks, like As-skitter and WikiTalk, 
the average query time of Hub^ -Labeling (HL) is only tens of mi- 

'http://snap.stanford.edu/data/index.html 

^http://socialnetworks.mpi-sws.org/ 

'http://socialcomputing.asu.edu/datasets/ 



croseconds (10 second), and except for one (Orkut), all of tham 
are less than 1ms. Overall, Hub^ -Labeling (HL) is on average 23 
times faster than BiBFS. Specifically, we observe that compared 
to BiBFS, the Hub-Pruning Bidirectional Search (HP-BBFS) of 
achieves significant improvement in terms of search space, which 
is around 800 times smaller than BiBFS (Table[5)- 

2) The Hub-Network (HN) is on average about 2 times faster than 
BiBFS (with no additional storage cost but reorganizes the network 
structure). It is about two orders of magnitude faster than BFS but 
is about 10 times slower than the Hub^-Labeling approach. 

3) Sketch (S*) is on average about 10 times faster than BFS but 
it fails to run on a few datasets. The TreeSketch (TS*) is on av- 
erage 70 times slower than Sketch. Both Hub-Network and Hub- 
Labeling approaches are are on average more than two orders of 
magnitude faster than Sketch, the fastest approximation method. 

4) For long distance queries d{u, v) > 4 the exact shortest path 
approaches require longer query time (Table |4]. However, the in- 
crease for the Hub-Network (HN) and Hub^ -Labeling (HL) are 
smaller than BFS and BiBFS. Also, it is interesting to observe the 
approximate shortest path approaches do not show performance de- 
crease though both of them are still very slow. 

Preprocessing Cost: Table[6]shows preprocessing cost of the Sketch- 
based approach along with HL, consisting of indexing size and pre- 
computation time. The first column S* records the index size (MB) 
for the Sketch method. The second column Hhtotai records total 
index size of Hub^ -Labeling (HL), which is the sum of core-hubs 
labeling cost and distance matrix size. Column record the 

average number of core-hubs stored by each vertex. Remarkably, 
the core-hub labeling scheme in Hub^ -Labeling (HL) is very ef- 
fective, as there is a very small portion of core-hubs recorded by 
each vertex. In most of the network, the average number of core- 
hubs per vertex is no more than 2% of the total hubs. In particular, 
for network WikiTalk, only 2.5 core-hubs are stored in each vertex 
on average, which potentially leads to efficient query answering. 
However, for LiveJoumal, the Hub^ -Labeling is too expensive to 
be materialized in the main memory. In terms of precomputation 
time, Hub^ -Labeling can be constructed faster than Sketch on 7 out 
of 10 networks. The construction time of HubNetwork (HN) is av- 
erage more than three times faster than the Hub^-Labeling (HL), 
and it does not need any additional memory cost. 
Impacts of Hub Number: In this experiment, we study the effect 
of different number of hubs on query performance. Here, we vary 
the hub-set size from 5, 000 to 15, 000 and conduct the experiment 
on 10, 000 randomly generated queries with various distances. Ta- 
ble |2] shows the average query time of Hub-Network (HN) and 
Hub^ -Labeling approaches using different number of hubs. In most 
of these networks, the best query performance is achieved when the 
number of hubs lies between lOiv and 15K. Though a large num- 
ber of hubs may potentially help reduce the search space of the 
bidirectional search in Hub^-Labeling (HL), it may also increase 
the size of core-hubs associated with each vertex. We observe that 
the query performance obtained by using lOK hub is comparable to 
the best one). Note that here due to space limitation, we do not re- 
port the detailed precomputation cost in terms of construction time 
and index size (for Hub'^ -Labeling). Overall, as the number of hub 
increases, most large networks, show an increasing trend regarding 
the average index size. Interestingly, when hub-set size increases, 
significant reduction of average index size is observed on WikiTalk. 
This is in part explained by its wery small diameter. In terms of the 
precomputation time, as more hubs are chosen, the computational 
cost of Hub-Network and Hub^ -Labeling becomes larger, because 
more BFS needs to performed. Indeed, the precomputation time 
increases almost linearly with respect to the hub-set size. 



Dataset 


1^1 


\E\ 


(5 


do A) 


Facebook 


63731 


1545686 


48.51 


8.2 


Slashdot 


82168 


948464 


23.09 


4.7 


Berks tan 


685230 


7600595 


22.18 


10 


Youtube 


1138499 


4945382 


8.69 


7.14 


As-skitter 


1696415 


11095298 


13.08 


5.9 


Flickr 


1715255 


22613981 


26.37 


7.32 


Flickr-growth 


2302925 


33140018 


28.78 


7.19 


Wiki-talk 


2394385 


5021410 


4.19 


4 


Orkut 


3072441 


223534301 


145.51 


5.7 


LiveJouinal 


5204176 


77402652 


29.75 


8.34 


Twitter 


11316811 


85331845 


15.08 


24.97 



Dataset 


\H\ = 5000 


\H\ = 8000 


\H\ = 10000 


\H\ = 15000 


HN 


HL 


HN 


HL 


HN 


HL 


HN 


HL 


Facebook 


0.043 


0.018 


0.044 


0.017 


0.042 


0.017 


0.040 


0.019 


Slashdot 


0.023 


0.002 


0.021 


0.001 


0.022 


0.001 


0.022 


0.002 


BerkSta 


0.011 


0.005 


0.005 


0.009 


0.010 


0.004 


0.014 


0.002 


Youtube 


0.106 


0.006 


0.119 


0.005 


0.125 


0.005 


0.136 


0.005 


As-skitter 


0.051 


0.016 


0.044 


0.015 


0.040 


0.013 


0.041 


0.011 


Flickr 


1.600 


0.112 


1.671 


0.073 


1.739 


0.067 


1.888 


0.061 


Flickr-growth 


0.998 


0.138 


1.130 


0.113 


1.193 


0.100 


1.236 


0.136 


Wiki-talk 


0.014 


0.002 


0.016 


0.002 


0.014 


0.002 


0.014 


0.001 


Orkut 


0.952 


3.653 


0.955 


3.314 


0.978 


3.356 


1.078 


3.282 


LiveJournal 


0.466 




0.526 




0.513 




0.577 




Twitter 


1.850 


0.306 


1.947 


0.314 


2.083 


0.340 


2.121 





Table 1: Network Statistics 



Table 2: Average Query Time with Different Hub Sizes (ms) 



Dataset 


BPS 1 S* 1 TS* 


BiBFS 1 HN 1 HL 


ms 




Facebook 


1.7 


0.5 


20.4 


55.2 


41.9 


17.4 


Slashdot 


1.4 


0.7 


34.5 


31.6 


22.2 


1.3 


Berks tan 


0.3 


4.7 


559.1 


33.9 


10.2 


3.5 


Youtube 


15.3 


2 


171.2 


312.2 


125.1 


5.4 


As-skitter 


4.9 


1.5 


114.9 


86.7 


40.4 


12.7 


Flickr 


42.6 


2.7 


288.7 


2887.9 


1738.8 


67.3 


Flickr-growth 


71.8 


5.1 


305 


1607.4 


1193.3 


100.3 


Wiki-talk 


18.8 






56.4 


14.1 


1.5 


Orkut 


202.5 


7.8 


258.5 


1338.7 


978.1 


3356.4 


LiveJournal 


131.4 






749.6 


513.1 




Twitter 


221.4 






2311.8 


2082.6 


339.7 



Dataset 


BPS 1 S* 1 TS* 


BiBPS 1 HN 1 HL 


ms 




Facebook 


1.9 


0.5 


19.6 


61.2 


45.7 


19.9 


Slashdot 


1.7 


0.7 


46.8 


31.4 


20.3 


1.5 


Berks tan 


0.3 


2.1 


206.7 


36.1 


10.6 


3.8 


Youtube 


16 


1.2 


95 


325.8 


130.7 


5.6 


As-skitter 


5.4 


1.2 


84.2 


94.7 


46.3 


14 


Flickr 


45.2 


2.9 


182.1 


3060 


1825.2 


79.1 


Flickr-growth 


71.9 


3.7 


332.5 


1616.6 


1219.6 


103.6 


Wiki-talk 


21.7 






58.3 


14.2 


1.1 


Orkut 


225.8 


3.4 


268 


1372.9 


1111.1 


4639.5 


LiveJournal 


127.7 






699.3 


524 




Twitter 


250.4 






2384.3 


2190.1 


254.5 



Table 3: Average Query Time on Random Query 



Table 4: Average Query Time on Random Query with Distance > 4 



Dataset 


BPS 


BiBFS 


HN 


HL 


HP-BBPS 


Join 


Facebook 


30589 


1723 


1867 


208 


466 


Slashdot 


41030 


1380 


1358 


3 


20 


Berks tan 


11099 


1462 


405 


78 


39 


Youtube 


505842 


13941 


6303 


78 


90 


As-skitter 


161878 


3580 


1551 


292 


265 


Flickr 


580315 


36161 


15494 


1431 


1330 


Flickr-growth 


777994 


23738 


12412 


2382 


1431 


Wiki-Talk 


1178526 


4255 


1111 


1 


7 


Orkut 


1522640 


29341 


21954 


71331 


5367 


LiveJournal 


1784211 


14172 


15554 






Twitter 


3275797 


55558 


54884 


13866 


10757 



Dataset 


Indexing Cost 


Preproc.Time(min) 


S*(MB) 


1IL„,,(MB) 




S* 


HN 


HL 


Facebook 


10 


955 


8.2 


3.2 


2.2 


3.8 


Slashdot 


26 


496 


11.1 


6.5 


1.3 


4.3 


Berks tan 


193 


291 


21.6 


64.3 


0.3 


1.7 


Youtube 


217 


757 


38.9 


100.8 


15.5 


66 


As-skitter 


391 


1229 


101.9 


109.9 


7.1 


31.7 


Flickr 


626 


1536 


232 


163.8 


43.4 


202.5 


Flickr-growth 


1004 


4403.2 


315.9 


242.8 


71.8 


363.5 


WikiTalk 




481 


2.5 




12.5 


41.2 


Orkut 


8397 


13517 


749.3 


773.2 


412.5 


1431.6 


LiveJournal 










334.2 




Twitter 




26931 


464 




233.9 


390.2 



Table 5: Average Search Space on Random Query 



Table 6: Preprocessing Cost on Random Query 



Dataset 


\H\ = 5000 


\H\ = 8000 


\H\ = 10000 


\H\ = 15000 




di(R) 






di(,M) 


d2[M) 


\H*\ 


di(R) 


d2[M) 


\tl \ 


di(,M) 


d2{M) 


Facebook 


20854 


247.7 


217.1 


27364 


202.7 


184.5 


30554 


182.2 


168.1 


36188 


146.6 


137.5 


Slashdot 


23359 


204.5 


179.5 


27581 


150.1 


135.6 


29500 


128.4 


117.2 


32665 


95.2 


88.0 


Berks tan 


8290 


769.3 


177.8 


16563 


574.3 


152.8 


24618 


492.8 


138.1 


34342 


364.6 


110.3 


Youtube 


49516 


587.5 


299.9 


69474 


429.9 


254.9 


76894 


369.4 


231.1 


100595 


279.2 


189.3 


As-skitter 


41371 


958.9 


211.0 


56245 


701.3 


184.8 


64785 


601.4 


171.3 


82439 


453.0 


146.3 


Flickr 


19198 


2539.3 


1433.3 


32972 


2005.8 


1364.7 


42312 


1776.7 


1295.0 


63774 


1403.7 


1128.9 


Flickr-growth 


22715 


3175.3 


1626.7 


38819 


2555.4 


1615.5 


49450 


2284.0 


1565.4 


74569 


1833.5 


1407.7 


Wiki-talk 


24139 


984.5 


294.7 


32435 


669.2 


220.3 


36081 


552.4 


188.8 


41567 


385.9 


139.9 


Orkut 


124607 


3808.5 


1720.9 


189686 


3022.9 


1763.0 


225678 


2734.3 


1763.4 


319989 


2305.0 


1720.0 


LiveJournal 


151348 


1172.3 


702.1 


229836 


1004.5 


673.4 


278203 


932.8 


653.7 


392423 


808.8 


611.0 


Twitter 


201521 


9556.6 


2877.8 


346091 


6762.9 


2641.2 


424853 


5749.2 


2463.3 


564435 


4267.5 


2084.0 



Table 7: Hub-Network Statistics 



Hub-Network Statistics: Finally, we report the basic statistics of 
the discovered distance preserving Hub-Network. Specifically, we 
are introduced in two following two questions: 1) given a set of 
hubs, how large the hub-network will be? What is the size of | * | ? 
2) what are the degree difference between the hubs in the original 
network and in the Hub-Network? Do we observe a significant de- 
gree decreasing? To answer these two questions, in Table |7] we re- 
port \H* I (the number of total vertices in the hub-network), d\ (H) 



the average degree of hubs in the original graph, and d2(H), the 
average degree of hubs in the extracted hub-network, with respect 
to 5K, 8K, lOK and 15K hubs. We observe for most graphs, 
the size of \H*\ is a few times larger than the hub number; how- 
ever, for Orkut, LiveJournal, and Twitter, the hub network becomes 
quite large at IQK and 15K hubs. Also, in general, the degree of 
hubs in the hub-network has been lowered and on several graphs, 
the average degree is reduced smaller than 1/3 of the original aver- 



age degree. We also observe that the ability of lowering degree is 
correlated with the search performance: the better the hub degree 
is lowered, the better query performance improvement we can get 
from the Hub-Network based bidirectional BFS. 

8. CONCLUSION 

In this paper, we introduce a set of novel techniques centered 
on hubs for fc-degree shortest path computation in large social net- 
works. The Hub-Network and Hub^-Labeling algorithms can help 
significantly reduce the search space. The extensive experimental 
study demonstrates that these approaches can handle very large net- 
works with millions of vertices, and its query processing is much 
faster than online searching algorithms and Sketch-based approaches, 
the state-of-the-art shortest path approximation algorithms. To the 
best of our knowledge, this is the first practical study on computing 
exact shortest paths on large social networks. In the future, we will 
study how to parallelize the index construction and query answer- 
ing process. We also plan to investigate how to compute fc-degree 
shortest path on dynamic networks. 
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